Compression of Strings with Approximate Repeats
نویسندگان
چکیده
We describe a model for strings of characters that is loosely based on the Lempel Ziv model with the addition that a repeated substring can be an approximate match to the original substring; this is close to the situation of DNA, for example. Typically there are many explanations for a given string under the model, some optimal and many suboptimal. Rather than commit to one optimal explanation, we sum the probabilities over all explanations under the model because this gives the probability of the data under the model. The model has a small number of parameters and these can be estimated from the given string by an expectation-maximization (EM) algorithm. Each iteration of the EM algorithm takes O(n2) time and a few iterations are typically sufficient. O(n2) complexity is impractical for strings of more than a few tens of thousands of characters and a faster approximation algorithm is also given. The model is further extended to include approximate reverse complementary repeats when analyzing DNA strings. Tests include the recovery of parameter estimates from known sources and applications to real DNA strings.
منابع مشابه
Detection of Signiicant Patterns by Compression Algorithms : the Case of Approximate Tandem Repeats in Dna Sequences. Rivals
0 To whom the reprint requests should be sent. 2 Abstract We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more signiicant is the property for that sequence. Here we present an algorithm to detect a particular type of dosDNA (Deened Ordered Sequence-DN...
متن کاملDna Data Compression Algorithms Based on Redundancy
Carl Jung said, 'Collective unconscious' i.e. we are all connected to each other in some way or the other via our DNA. In frequent cases there are four bases in a DNA. They are a (Adenine), c (Cytosine), g (Guanine) and t (Thymine). Each of these bases can be represented by two bits as 2 powers 2 =4 i.e. a – 00, c – 01, g – 11 and t – 10 respectively, although this choice is random. So redundan...
متن کاملA First Step Toward Chromosome Analysis
In this paper, we use Kolmogorov complexity and compression algorithms to study DOS-DNA (DOS: de-ned ordered sequence). This approach gives quantitative and qualitative explanations of the regularities of apparently regular regions. We present the problem of the coding of approximate multiple tandem repeats in order to obtain compression. Then we describe an algorithm that allows to nd eecientl...
متن کاملDetection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences
MOTIVATION Compression algorithms can be used to analyse genetic sequences. A compression algorithm tests a given property on the sequence and uses it to encode the sequence: if the property is true, it reveals some structure of the sequence which can be described briefly, this yields a description of the sequence which is shorter than the sequence of nucleotides given in extenso. The more a se...
متن کاملRepeats and Palindromes: an Overview
With a long text string like DNA, repeats and palindromes are not easily spotted. Yet nding such substrings is important; for instance, repeats in DNA are indicators of certain hereditary disorders and are used as genetic markers. We discuss repeats and then palindromes and then we relate the two. In our discussion of repeats, we rst de ne an exact repeat and then ve de nitions of approximate r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Proceedings. International Conference on Intelligent Systems for Molecular Biology
دوره 6 شماره
صفحات -
تاریخ انتشار 1998